Univariate Plots Section
First, I will look into the dataframe structure.
## 'data.frame': 113937 obs. of 15 variables:
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
To better reflect the nature of the data, I will change the data type of some columns from int to factor (ListingNumber and Term), convert ListingCreationDate from factor to date object, then change the ListingCategory from numeric to factor to make the interpretation more intuitive.
Now let’s look at the summary of the working dataset!
## ListingNumber ListingCreationDate Term
## 951186 : 6 Min. :2005-11-09 12: 1614
## 882888 : 4 1st Qu.:2008-09-19 36:87778
## 892845 : 4 Median :2012-06-16 60:24545
## 1056749: 4 Mean :2011-07-08
## 1057901: 4 3rd Qu.:2013-09-09
## 875616 : 3 Max. :2014-03-10
## (Other):113912
## LoanStatus BorrowerAPR ListingCategory
## Current :56576 Min. :0.00653 DebtConsolidation:58308
## Completed :38074 1st Qu.:0.15629 NotAvailable :16965
## Chargedoff :11992 Median :0.20976 Other :10494
## Defaulted : 5018 Mean :0.21883 HomeImprovement : 7433
## Past Due (1-15 days) : 806 3rd Qu.:0.28381 Business : 7189
## Past Due (31-60 days): 363 Max. :0.51229 Auto : 2572
## (Other) : 1108 NA's :25 (Other) :10976
## BorrowerState Occupation EmploymentStatus
## CA :14717 Other :28617 Employed :67322
## TX : 6842 Professional :13628 Full-time :26355
## NY : 6729 Computer Programmer : 4478 Self-employed: 6134
## FL : 6720 Executive : 4311 Not available: 5347
## IL : 5921 Teacher : 3759 Other : 3806
## : 5515 Administrative Assistant: 3688 : 2255
## (Other):67493 (Other) :55456 (Other) : 2718
## EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## Min. : 0.00 Min. : 0.0 Min. : 19.0
## 1st Qu.: 26.00 1st Qu.:660.0 1st Qu.:679.0
## Median : 67.00 Median :680.0 Median :699.0
## Mean : 96.07 Mean :685.6 Mean :704.6
## 3rd Qu.:137.00 3rd Qu.:720.0 3rd Qu.:739.0
## Max. :755.00 Max. :880.0 Max. :899.0
## NA's :7625 NA's :591 NA's :591
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## Min. : 0.000 Min. : 0 Min. : 1000
## 1st Qu.: 0.140 1st Qu.: 3200 1st Qu.: 4000
## Median : 0.220 Median : 4667 Median : 6500
## Mean : 0.276 Mean : 5608 Mean : 8337
## 3rd Qu.: 0.320 3rd Qu.: 6825 3rd Qu.:12000
## Max. :10.010 Max. :1750003 Max. :35000
## NA's :8554
It appeared odd to me that there were duplications in the listing number. The listing number should be unique since it represents each individual loan listings so I will take a closer look at some of them.
## ListingNumber ListingCreationDate Term LoanStatus BorrowerAPR
## 13079 951186 2013-10-02 60 Current 0.16662
## 14889 951186 2013-10-02 60 Current 0.16662
## 20570 951186 2013-10-02 60 Current 0.16662
## 31451 951186 2013-10-02 60 Current 0.16662
## 42751 951186 2013-10-02 60 Current 0.16662
## 42752 951186 2013-10-02 60 Current 0.16662
## ListingCategory BorrowerState Occupation EmploymentStatus
## 13079 DebtConsolidation MD Other Employed
## 14889 DebtConsolidation MD Other Employed
## 20570 DebtConsolidation MD Other Employed
## 31451 DebtConsolidation MD Other Employed
## 42751 DebtConsolidation MD Other Employed
## 42752 DebtConsolidation MD Other Employed
## EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## 13079 26 720 739
## 14889 26 720 739
## 20570 26 720 739
## 31451 26 720 739
## 42751 26 720 739
## 42752 26 720 739
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 13079 0.41 3000 10000
## 14889 0.41 3000 10000
## 20570 0.41 3000 10000
## 31451 0.41 3000 10000
## 42751 0.41 3000 10000
## 42752 0.41 3000 10000
After checking the top three listings (shown above is the first example), it seems that they are indeed duplications of the same listing so I will keep only the unique loan listings which add up to 113066 observations.
Now let’s look at individual variables.

## Min. 1st Qu. Median Mean 3rd Qu.
## "2005-11-09" "2008-09-13" "2012-06-08" "2011-07-02" "2013-08-31"
## Max.
## "2014-03-10"
The listings range from November 2005 to March 2014. There is a gap at year 2009 in which there were no listings at all, splitting the data into two parts. There is a sharp increase in the number of listings in 2013.

## 12 36 60
## 1614 87224 24228
The majority of the loans is of 36-month term (87224) followed by 60-month term (24228) and 12-month term (1614), indicating that the loan dataset is comprised of short term loans.


The majority of the loans are either current or completed with remaining loans that are mainly charged off, defaulted or past due (Top plot). I decided to create a simplified column with only three levels: “Current”, “Completed” and “Bad_loan” (Bottom plot).

The most common reason for taking out a loan in the dataset is debt consolidation. There is also a fraction of loans that does not provide a concrete reason (Not available or other). Home improvement and business are also main listing categories in this dataset.

It looks like California has the most of the loan listings followed by Texas, Florida and New York.

This is a bit of a busy plot at the x-axis but it nicely demonstrates all flavors of occupation the borrowers had during loan application. Although the highest level is “other” and does not provide too much information, we can see that the top three occupations are professionals, computers programmers and executives.

I am quite confused with the classification of this category. More than half of the borrowers are employed, although from the choices (levels) of the employment status it is not clear what is the difference between “employed” vs. “full-time” or “Part-time’ so the proportion might be different depending on the interpretation. Overall most borrowers are employed.
I suspect that the choices in this category was changed at some point so I decide to look at the time distribution of the EmploymentStatus categories (see the chart below).

Unfortunately, the result is still confusing to me. For example, “Employed” and “Full-time” are two overlapping classifications to me and I was hoping to find that the distribution of the counts for these two classifications would be distinct (ie. one is collect at a period that is not overlapped to the other). However, it seemed that in the dataset both of them are used to classify loans that are collected at the same time and I have no further distinction on how the choice was made.
Next I will look at distribution of the numeric columns starting with APR.

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21900 0.28390 0.51230 25
The APR histogram appears bimodal and has a broad, normally distributed first peak at 0.2%, near the median(red) and mean(green), and a sharp second peak at 0.36%.
There are 25 listings that are from the very beginning of the dataset that don’t have APR information. I will remove those listings since my focus is on characterizing the APR.

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.06 137.00 755.00 7600
Employment status duration for the borrowers range from 0 month to 755 months. The original distribution was positively skewed with a long tail to the right (Top plot). Transformation by taking log10 seems to result in a slightly negatively skewed distribution (Middle plot). Transformation of the data by square rooting seems to reduce the longtailness (Bottom plot).

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3198 4667 5604 6812 1750000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.505 3.669 3.615 3.833 6.243
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 56.55 68.31 70.66 82.54 1323.00
StatedMonthlyIncome also shows positive skewness (first summary) but can be tranformed to normal distribution by taking log10 (second summary) or square root (third summary).
I am puzzled by the large number of maximum StatedMonthlyIncome so decided to take a closer look.
## ListingNumber ListingCreationDate Term LoanStatus BorrowerAPR
## 53015 694401 2013-01-04 12 Completed 0.25785
## 53168 560783 2012-02-20 36 Current 0.35797
## ListingCategory BorrowerState Occupation EmploymentStatus
## 53015 Business CA Other Self-employed
## 53168 Business OR Professional Self-employed
## EmploymentStatusDuration CreditScoreRangeLower CreditScoreRangeUpper
## 53015 117 720 739
## 53168 114 740 759
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount
## 53015 NA 618547.8 4000
## 53168 NA 1750002.9 4000
It appears that for a StatedMonthlyIncome greater than 500000 there are two loan lisitngs for businesses. The StatedMonthlyIncome might be large because it is representing income for the whole business and not individuals. These are likely true outliers.

This plot shows both the upper credit score in red and the lower credit score in blue. The trends are very similar so I will only keep the upper range scores.


## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 679.0 699.0 704.5 739.0 899.0 566
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 439.0 679.0 699.0 705.3 739.0 899.0
There are a few outliers with very low credit scores that negatively skewed the data slightly. After removing those with lower values (less than 400), the variable seemed normally distributed. I would think that the lower credit score is consistent with the fact that many of these loans were targeted for debt consolidation. I would be curious to find out if there is a correlation between low credit score and bad loans though.


## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8472
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.057 0.086 0.095 0.121 1.042 8472
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.374 0.469 0.482 0.566 3.164 8472
While most of the DTI are at 0.2, we are seeing outliers with big numbers of DTI. They are likely to be real because of the nature of our dataset so I will keep them. Data transformation can be used to normalize the data.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8316 12000 35000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.602 3.799 3.794 4.079 4.544
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.62 63.25 79.37 85.13 109.50 187.10
The distribution of loan original amount appear “spiky” with specific peaks presumably representing the amount for a specific type of loan. However, It looks to me that the original data appeared positively skewed and log transformation appear to help normalizing the dataset.
Bivariate Plots Section
Here I will focus on the following questions:
I. Which numeric variable has a correlation with APR?
II. How does APR change over time and other variables?
III. What are the characteristics of the current, completed and bad loans?
I. Which numeric variable has a correlation with APR?
To reduce computing time for initial analysis I created a smaller data set with only numeric variables and randomly selected ten thousand loan listings to generate correlations plots of those numeric variables against each other using ‘’’ggpairs’‘’. I also generated new numeric columns with modified values (’‘’log10’‘’ or’‘’sqrt’’’) to normalized some of the variables.
## BorrowerAPR EmploymentStatusDuration CreditScoreRangeUpper
## Min. :0.00653 Min. : 0.00 Min. : 19.0
## 1st Qu.:0.15629 1st Qu.: 26.00 1st Qu.:679.0
## Median :0.20984 Median : 67.00 Median :699.0
## Mean :0.21898 Mean : 96.06 Mean :704.5
## 3rd Qu.:0.28386 3rd Qu.:137.00 3rd Qu.:739.0
## Max. :0.51229 Max. :755.00 Max. :899.0
## NA's :7600 NA's :566
## DebtToIncomeRatio StatedMonthlyIncome LoanOriginalAmount sqrt_ESD
## Min. : 0.000 Min. : 0 Min. : 1000 Min. : 0.000
## 1st Qu.: 0.140 1st Qu.: 3198 1st Qu.: 4000 1st Qu.: 5.099
## Median : 0.220 Median : 4667 Median : 6300 Median : 8.185
## Mean : 0.276 Mean : 5604 Mean : 8316 Mean : 8.629
## 3rd Qu.: 0.320 3rd Qu.: 6812 3rd Qu.:12000 3rd Qu.:11.705
## Max. :10.010 Max. :1750003 Max. :35000 Max. :27.477
## NA's :8472 NA's :7600
## log10_SMI sqrt_DTIR
## Min. :0.000 Min. :0.000
## 1st Qu.:3.505 1st Qu.:0.374
## Median :3.669 Median :0.469
## Mean :3.615 Mean :0.482
## 3rd Qu.:3.833 3rd Qu.:0.566
## Max. :6.243 Max. :3.164
## NA's :8472

BorrowerAPR have some small but significant negative correlation with CreditScoreRangeUpper (-0.442) and LoanOriginalAmount (-0.325) but not with EmploymentStatusDuration, DebtToIncomeRatio or StatedMonthlyIncome. Additionally, LoanOriginalAmount has small but significant positive correlation with CreditScoreRangeUpper(0.344) and StatedMonthlyIncome(0.38). None of the modified variables showed a significant improvement in correlation.

##
## Pearson's product-moment correlation
##
## data: working_prosper_loan$BorrowerAPR and working_prosper_loan$CreditScoreRangeUpper
## t = -159.41, df = 112470, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4340443 -0.4245100
## sample estimates:
## cor
## -0.4292891
Here we can see the negatively trend between high credit score and borrower APR with a pearson’s coefficient of -0.429 for the entire dataset.

##
## Pearson's product-moment correlation
##
## data: working_prosper_loan$BorrowerAPR and working_prosper_loan$LoanOriginalAmount
## t = -114.48, df = 113040, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3275368 -0.3170890
## sample estimates:
## cor
## -0.3223227
We see that higher loan amounts tend to correlate with lower APR. This is specifically true when the loan amount is large. The discrete horizontal lines are due to the fact that loan amounts are given out at a specific number. The pearson’s coefficient for the entire dataset is -0.322.
The two major factors that correlate with APR in this dataset are CreditScoreRangeUpper and LoanOriginalAmount, which makes these two variables top candidates for building a model. I would also like to point out that StatedMonthlyIncome might also have a very mild effect on BorrwerAPR but the coefficient was very low at -0.165. I was surprised to find that DebtToIncomeRatio was not correlated with BorrowerAPR. I suspect that this factor is ignored because many of the loans are used for debt consolidation.
II. How does APR change over time and other variables?

Generally a line plot would be the best to visualize trends over time. However, because we are looking at individual loan listings, a scatter plot turned out to show the APR trend more clearly. We see nicely the upper bound and lower bound of APR over time. For example, during the first few months of data collection, the APR range from 0.03% to slightly over 0.5%. On the other hand, from 2011 to 2014, the lower bound is around 0.06% and the upper bound is around 0.36%. The extreme high and low rates also appear to be rates prior to 2009. The number of loan listings at a specific time is also highlighted by the different shades of grey, with darker dots representing higher number and vice versa.
The flactuation of rates at different times poses an inconsistency in which a high APR at one time could be only moderate in another time. The difference in the upper and lower bound of the APR could explain why the correlation between APR and other variables were not stronger.
##
## Pearson's product-moment correlation
##
## data: ListingCreationDateSubset$BorrowerAPR and ListingCreationDateSubset$CreditScoreRangeUpper
## t = -134.63, df = 28293, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6319356 -0.6177307
## sample estimates:
## cor
## -0.6248849
When I limit the dates to a period of time with similar upper and lower bound in APR (May 2006 to Nov. 2008), I see an increase in correlation coefficient to -0.625.
Another trend that I observed was that between Mar. 2011 and Jul. 2012, BorrowerAPR were offered at very fixed rates, indicated by the blank spaces flanked by dark dotted lines. This trend is very different than other times during the data collection.
The magenta dotted line marks the 0.36% APR which represented the second sharp peak observed in the BorrowerAPR histogram in the previous section. There is a period from Dec. 2010 to Dec. 2012 where 0.36% was offered extensively that could account for that peak.
Next I will look into how the APR trend is affected by other categorical variables.

## # A tibble: 3 × 4
## Term mean median n
## <fctr> <dbl> <dbl> <int>
## 1 12 0.2162232 0.22189 1614
## 2 36 0.2195722 0.20984 87199
## 3 60 0.2170339 0.20987 24228
Here we see that 12-month term has slightly higher median than the other two terms but 36-month term has the higest mean (green dots). Overall the stats are very similar, suggessting that terms don’t affect the APR much.

12-month and 60 month loans were launched only after end of 2011. 12-month loans discontinued after 2013.

## # A tibble: 3 × 4
## LoanStatus mean median n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 0.2088913 0.195010 38239
## 2 Current 0.2139750 0.205240 55730
## 3 Bad_loan 0.2538349 0.256385 19072
Here we see that listings in the Bad_loan category has higher APR than the other two categories. One possible explanation for this difference is that the bad loans were issued only during the time when the rate was higher. Alternatively, it could also be that loans with higher APR tend to end up being bad loans. So I will look into how these loan types were distributed over time.

Bad loans ranged from the beginning to the end of the data collection time. There were also more loan listings with lower APR in the completed loans than bad loans, suggesting a correlation between high APR and bad loans.

## # A tibble: 21 × 3
## ListingCategory median n
## <fctr> <dbl> <int>
## 1 PersonalLoan 0.176260 2395
## 2 NotAvailable 0.185250 16940
## 3 Boat 0.204620 85
## 4 BabyAndAdoption 0.204930 196
## 5 DebtConsolidation 0.206260 57624
## 6 StudentUse 0.214505 756
## 7 LargePurchases 0.218580 863
## 8 Business 0.219450 7157
## 9 HomeImprovement 0.223620 7388
## 10 Motorcycle 0.230825 304
## # ... with 11 more rows
BorrowerAPR differed depending on the ListingCategory. Personal loan and Not available are the two categories that has the lowest APR, much less than the overall APR median (red dashed line); whereas Household expenses and Cosmetic procedures are the two categories with the highest APRs, way above the overall median.

When we split the time analysis data into listing categories, we see the following: 1. Loans in personal loan and student use category were once offered but discontinued in the middle of the data collection period. 2. Loans in debt consolidation, home improvement, business, auto and other category were offered consistently throughout the data collection period 3. There were a series of new loans offered after 2012, among those the more popular ones include Medical or dental, wedding loans, household expenses, etc. The medians of these different types of loans vary drastically. 4. Most of the Not available listings were collected before 2008 and all others were collected only after 2008, suggesting that this variable was not collected until after 2008.
Combining the two plots together we can also find out that the rates before 2008 have lower top and bottom limits, which might explain why the Not available listing has lower median. Similarly, there were more personal loans with low APR given out, resulting in a lower median.

Maine and Iowa have the lowest APRs (with outliers) and Arkansas and Alabama have the highest APRs. At first I thought it was due to the state usury limit, but after checking this article I didn’t find a correlation. For example, the legal rate of interest for ME, AR and AL are all 6%. Overall, the BorrowerAPR varies from state to state.

When we split the time series data into borrower states we can see the following, 1. Loans offered to IA, ME and ND were discontinued after 2009. In contrast, loans offered to SD only started after 2009. 2. Early discontinuation of loans in ND and IA could count for the missing peak at 0.36% in the APR histogram in previous section. 3. There seems to be an increase in issueing many more loans after 2012 in most of the states. 4. The NA listings seem to exist only prior to 2008, similar to Listing Category. 5. Before 2008, BorrowerAPR seemed drastically different from state to state (eg.CA vs CO). However, after 2008 the high and low limits of BorrowerAPR are very similar.
When we look at this plot we can also see individual reasons for why the median BorrowerAPR is different for each state. For example, the low rates of Maine and Iowa are due to the fact that the loans were issued at a time when the rates were low and discontinued when the rates went up.

Judge and Doctor receive the lowest APR. Nurse’s Aide and Teacher’s Aide receive the highest APR. Again we see that BorrowerAPR varies among different occupations adn seems to be related to the socialeconomic status of the borrower.

The data collection for Occupation seems quite completed with most of the levels covering from 2006 to 2014. There are two groups of NAs at the beginning (2006 - 2008) and end (2014) of the data collection time.

Part-time and Full-time status have lowest APRs whereas Other and Not employed have highest APRs.

Here we see again that the definition of levels in this category is confusing. There is almost no data for “Employed” before 2010, which was probably classified as “Full-time” based on the density (shades of gray) of the loan listings. There is also decreased number of loans in “part-time” and “retired” after 2011 and it’s not clear if the company decided not to offer loans to these categories or were they included in “others”, another category that began to accumulate data only after mid-2010.
One thing that I noticed is that listings after 2009, with borrowers that are not employed, consistently had higher APRs on the lower range, set arbitrarily at 0.1% (pink dotted line) when compared to the other levels in this variable. Additionally, there seems to be an increase of issueing loans to self- employed borrowers after 2011.
In this section we look into how loan listing were offered regionally, over time, and to what type of borrowers. We learn that BorrowerAPR varies from level to level in ListingCategory, BorrowerState, Occupation, LoanStatus and EmploymentStatus but remains similar in Term. Furthermore, we learn that data points in Occupation, Term and LoanStatus are collected consistently throughout the data collection period but not in BorrowerState, ListingCategory and EmploymentStatus, resulting in more missing values in the later variables.
III. Characteristics of bad loans
In previous section we see that BorrowerAPR is higher in the Bad_loan level so I want to look further into other variables that show a difference to futher characterize the bad loan category.

## # A tibble: 3 × 3
## LoanStatus median mean
## <fctr> <dbl> <dbl>
## 1 Completed 699 704.6653
## 2 Current 719 717.8541
## 3 Bad_loan 679 664.9343
Both mean and median is lower for Bad_loans than for Completed and Current loans.

Here we can see that the low credit scores are from between 2006 to 2007.

A zoom-in of CreditScoreRange shows that Current loans has the highest median of credit scores followed by Completed then Bad_loan.

We know from previous session that the distribution is very positively skewed. It’s hard to see the bulk of the data because of the outliers so I will zoom-in first.

## # A tibble: 3 × 4
## LoanStatus median mean n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 0.19 0.2641048 35490
## 2 Current 0.23 0.2619848 51712
## 3 Bad_loan 0.22 0.3425138 17367
Current and Bad_loan have similar medians but Bad_loan has a much higher mean, suggesting that it might have more data points with higher DTI ratio. To test that, I decide to look at the summary of data points with DTI ratio greater than one.
## # A tibble: 3 × 4
## LoanStatus median mean n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 3.02 5.067957 346
## 2 Current 1.67 3.526225 151
## 3 Bad_loan 5.10 5.734488 301
When I look at median and mean of DTI ratios greater than one, I see that Bad_loan has higher median and mean than the other two categories.


## # A tibble: 3 × 4
## LoanStatus median mean n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 4416.667 5326.269 38239
## 2 Current 5166.667 6155.481 55730
## 3 Bad_loan 3750.000 4549.716 19072
Bad_loan has the smallest median and mean for StatedMonthlyIncome, suggesting that borrowers in this group tend to have less monthly income.

## # A tibble: 3 × 4
## LoanStatus median mean n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 4500 6201.174 38239
## 2 Current 10000 10346.692 55730
## 3 Bad_loan 4500 6620.950 19072
Current loans have the highest median and mean followed by Completed and Bad_loans. This could be because both Completed and Bad_loans have more data points from earlier times when the loan amount was less. To check it I will only look at data points that are after 2009-08-01.

## # A tibble: 3 × 4
## LoanStatus median mean n
## <fctr> <dbl> <dbl> <int>
## 1 Completed 5000 7247.232 13436
## 2 Current 10000 10345.499 55700
## 3 Bad_loan 5000 7137.587 7131
It doesn’t appear to be the case as advancing the listing creating date only increase a little of the median and the mean of Completed and Bad_loans but they didn’t catch up with Current loans.
Multivariate Plots Section
I would imagine that for a loan company, what they want to see least is the occurrence of bad loans. Therefore in this section, I would like to focus on looking at what are the characteristics of bad loans from a few different angles.
I. How is LoanStatus distributed on the APR vs. CreditScoreUpper scatterplot?
From previous section we saw that BorrowerAPR and CreditScoreUpper had the best negative correlation so I would like to find out how the distribution of LoanStatus is within this correlation.

At this resolution, we see that both the Completed and Bad_loan classes have a more diverse distribution with no particular pattern. The Current class is more centered on the upper half of the distribution, indicating that current loans have a higher requirement of CreditScoreRange.
II. What is the trend of average BorrowerAPR of Bad_loans over time?
We saw that BorrowerAPR is higher in the Bad_loan category. Did this happen throughout the data collection time or was it because of some specific period that had a very high rate? To find out, I will look at the trend of BorrowerAPRs of different LoanStatus over time.

Despite a similar trend to the Completed listings, the Bad_loan listings had an overall higher BorrowerAPR all the time.
Is this true when we split the data into different categorical variables?

After splitting the data into subgroups, some levels with few data points such as boat or RV did not produce the curves or had curves with high deviation ( overfitting). When we focus on levels with sufficient data points we see that all of them had Bad_loan listing with higher BorrowerAPR just like we saw earlier.

There seemed to be more variation of LoanStatus pattern but overall the Bad_loan listings have higher BorrowerAPRs with the exception of AK.

Even though the difference is not as obvious after splitting, most levels in Occupation have higher BorrowerAPR in Bad_loan listings with the exception of flight Attendant.
Overall the higher BorrowerAPR of Bad_loans persisted over time across most of the categorical variables.
III. What is the percent of bad_loans in different listing categories
across different states?
I am curious to see if there are spicific traits for Bad_loan listings. For example, is there a certain listing category type in a specific state that has a higher chance to become a Bad_loan? or, is there an occupation within a listing category with higher percentage of Bad_loan?
## Source: local data frame [636 x 5]
## Groups: ListingCategory, BorrowerState [636]
##
## ListingCategory BorrowerState LoanStatus count ratio
## <fctr> <fctr> <fctr> <int> <dbl>
## 1 NotAvailable Bad_loan 1623 0.29785282
## 2 NotAvailable CA Bad_loan 1030 0.47884705
## 3 DebtConsolidation CA Bad_loan 700 0.09205681
## 4 NotAvailable TX Bad_loan 534 0.53993933
## 5 NotAvailable GA Bad_loan 469 0.50214133
## 6 NotAvailable IL Bad_loan 401 0.47176471
## 7 DebtConsolidation FL Bad_loan 382 0.11225389
## 8 DebtConsolidation IL Bad_loan 332 0.10946258
## 9 DebtConsolidation NY Bad_loan 302 0.08079187
## 10 DebtConsolidation GA Bad_loan 300 0.13215859
## # ... with 626 more rows

I first grouped the data by ListingCategory then by BorrowerState. I calculated the ratio of each levels in LoanStatus. I then selected a table with only Bad_loan listings and use the tile plot to visualize the ratio of Bad_loans in color with the corresponding number of Bad_loan listings in white.
As shown in this plot, each tile represent a ListingCategory type (x-axis) in a specific state (y-axis). Empty tiles indicate that there is no such subcategory or that there is no Bad_loan listings in that subcategory.
A few trends can be observed from the plot:
All light blue tiles (those with higher percentage of bad loan) has relatively small sample size, suggesting that these are events that most likely happened by chance and not due to the possibility that a specific category in a state is more prone to result in bad loans. However, the State of SD seemed to have a higher number of light blue tiles that I would recommend to pay a closer attention.
The left side of the figure has less numbers of listings than the right side of the figure presumably due to the fact that these categories were introduced later in the dataset.
DebtConsolidation has mostly dark blue tiles, indicating that the company is doing a good job in managing their main sector of business.
PersonalLoan and StudentUse has more lighter blue tiles than other categories which might explain why these two categories were discontinued.
Taken together, this will be a good figure to use for following up on the performance of the loans in a more detailed manner.